AITopics | phoneme sequence

Collaborating Authors

phoneme sequence

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

e99ed1162e984a5f08cb57ecde2d2231-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 13:31:28 GMT

machine learning, natural language, segmentation model, (15 more...)

Neural Information Processing Systems

Country:

Asia > Taiwan (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(2 more...)

Add feedback

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

Tu, Huu Tuong, Khanh, Ha Viet, Dat, Tran Tien, Huan, Vu, Van Luong, Thien, Cuong, Nguyen Tien, Trang, Nguyen Thi Thu

arXiv.org Artificial IntelligenceNov-26-2025

ABSTRACT Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pre-trained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training. Index T erms-- Mispronunciation detection and diagnosis, retrieval-based methods, training-free framework, automatic pronunciation assessment 1. INTRODUCTION Mispronunciation Detection and Diagnosis is a fundamental task in Computer-Assisted Pronunciation Training (CAPT).

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.20107

Genre: Research Report (0.64)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication

Kim, Deok-Seon, Lee, Seo-Hyun, Yin, Kang, Lee, Seong-Whan

arXiv.org Artificial IntelligenceNov-3-2025

Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on decoding predefined words or sentences, achieving open-vocabulary neural communication comparable to natural human interaction requires decoding unconstrained speech. Additionally, effectively integrating diverse signals derived from speech is crucial for developing personalized and adaptive neural communication and rehabilitation solutions for patients. This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes by leveraging phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals. Furthermore, we examine the properties affecting phoneme decoding accuracy during sentence reconstruction and offer neurophysiological insights to further enhance EEG decoding for more effective neural communication solutions. Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences, highlighting a significant step toward developing open-vocabulary neural communication systems adapted to diverse patient needs and conditions. Additionally, this study provides meaningful insights into the development of communication and rehabilitation solutions utilizing EEG-based decoding technologies.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TNSRE.2025.3625219

2510.27247

Country: Europe (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

Neural Information Processing SystemsOct-10-2025, 20:13:10 GMT

Automatic speech recognition (ASR) systems convert speech signals into their transcription texts.

boundary, phoneme prediction model, segmentation model, (12 more...)

Neural Information Processing Systems

Country:

Asia > Taiwan (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

Nguyen, Minh N. H., Tran, Anh Nguyen, Dinh, Dung Truong, Van Vo, Nam

arXiv.org Artificial IntelligenceSep-23-2025

Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios. The challenge is particularly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). The TSPC employs a phoneme-centric approach, built upon an extended Vietnamese phoneme set as an intermediate representation to facilitate mixed-lingual modeling. Experimental results demonstrate that TSPC consistently outperforms existing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 19.9% with reduced training resources. Furthermore, the phonetic-based two-stage architecture enables phoneme adaptation and language conversion to enhance ASR performance in complex CS Vietnamese-English ASR scenarios

machine learning, natural language, recognition, (18 more...)

arXiv.org Artificial Intelligence

2509.05983

Country: Asia > Vietnam (0.29)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge

Parikh, Aditya Kamlesh, Tejedor-Garcia, Cristian, Cucchiarini, Catia, Strik, Helmer

arXiv.org Artificial IntelligenceSep-1-2025

Computer-Assisted Pronunciation Training (CAPT) systems employ automatic measures of pronunciation quality, such as the goodness of pronunciation (GOP) metric. GOP relies on forced alignments, which are prone to labeling and segmentation errors due to acoustic variability. While alignment-free methods address these challenges, they are computationally expensive and scale poorly with phoneme sequence length and inventory size. To enhance efficiency, we introduce a substitution-aware alignment-free GOP that restricts phoneme substitutions based on phoneme clusters and common learner errors. We evaluated our GOP on two L2 English speech datasets, one with child speech, My Pronunciation Coach (MPC), and Spee-chOcean762, which includes child and adult speech. We compared RPS (restricted phoneme substitutions) and UPS (unrestricted phoneme substitutions) setups within alignment-free methods, which outperformed the baseline. We discuss our results and outline avenues for future research.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-829

2506.0208

Country: Europe (0.93)

Genre: Research Report > New Finding (0.66)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Teng, Matthew Kit Khinn, Zhang, Haibo, Saitoh, Takeshi

arXiv.org Artificial IntelligenceJul-28-2025

Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging due to the absence of auditory cues and the visual ambiguity of phonemes that exhibit similar visemes-distinct sounds that appear identical in lip motions. Existing methods often aim to predict words or characters directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity and require large amounts of pre-training data. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by an LLM model for word reconstruction to address these challenges. Stage 1 consists of V-ASR, which outputs the predicted phonemes, thereby reducing training complexity. Meanwhile, the facial landmark features address speaker-specific facial characteristics. Stage 2 comprises an encoder-decoder LLM model, NLLB, that reconstructs the output phonemes back to words. Besides using a large visual dataset for deep learning fine-tuning, our PV-ASR method demonstrates superior performance by achieving 17.4% WER on the LRS2 and 21.0% WER on the LRS3 dataset.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2507.18863

Country: Asia > Japan (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

LLM-based phoneme-to-grapheme for phoneme-based speech recognition

Ma, Te, Bi, Min, Yusuyin, Saierdaer, Huang, Hao, Ou, Zhijian

arXiv.org Artificial IntelligenceJun-6-2025

In automatic speech recognition (ASR), phoneme-based multilingual pre-training and crosslingual fine-tuning is attractive for its high data efficiency and competitive results compared to subword-based models. However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). A challenge is that there seems to have information loss in cascading S2P and P2G. To address this challenge, we propose two training strategies: data augmentation with noisy phonemes (DANP), and randomized top-K marginalized (TKM) training and decoding. Our experimental results show that LLM-P2G outperforms WFST -based systems in crosslin-gual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2506.04711

Country:

North America > United States (0.04)
Asia > China > Guangxi Province (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

PHISH in MESH: Korean Adversarial Phonetic Substitution and Phonetic-Semantic Feature Integration Defense

Kim, Byungjun, Kim, Minju, Park, Hyeonchu, Kim, Bugeun

arXiv.org Artificial IntelligenceMay-28-2025

As malicious users increasingly employ phonetic substitution to evade hate speech detection, researchers have investigated such strategies. However, two key challenges remain. First, existing studies have overlooked the Korean language, despite its vulnerability to phonetic perturbations due to its phonographic nature. Second, prior work has primarily focused on constructing datasets rather than developing architectural defenses. To address these challenges, we propose (1) PHonetic-Informed Substitution for Hangul (PHISH) that exploits the phonological characteristics of the Korean writing system, and (2) Mixed Encoding of Semantic-pHonetic features (MESH) that enhances the detector's robustness by incorporating phonetic information at the architectural level. Our experimental results demonstrate the effectiveness of our proposed methods on both perturbed and unperturbed datasets, suggesting that they not only improve detection performance but also reflect realistic adversarial behaviors employed by malicious users.

detector, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.2138

Country:

Europe (1.00)
North America > Mexico (0.28)

Genre: Research Report > New Finding (0.88)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction

Lee, Jihwan, Feng, Tiantian, Kommineni, Aditya, Kadiri, Sudarsana Reddy, Narayanan, Shrikanth

arXiv.org Artificial IntelligenceJan-8-2025

Brain-computer interfaces (BCI) offer numerous human-centered application possibilities, particularly affecting people with neurological disorders. Text or speech decoding from brain activities is a relevant domain that could augment the quality of life for people with impaired speech perception. We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals by utilizing an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences. The proposed model architecture consists of three main parts: EEG module, speech module, and phoneme predictor. The EEG module learns to properly represent EEG signals into EEG embeddings. The speech module generates speech waveforms from the EEG embeddings. The phoneme predictor outputs the decoded phoneme sequences in text modality. Our proposed approach allows users to obtain decoded listened speech from EEG signals in both modalities (speech waveforms and textual phoneme sequences) simultaneously, eliminating the need for a concatenated sequential pipeline for each modality. The proposed approach also outperforms previous methods in both modalities. The source code and speech samples are publicly available.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.04844

Country: North America > United States > California (0.46)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback